DEPARTMENT OF ELECTRICAL AND ELECTRONIC ENGINEERING **EXAMINATIONS 2006** 

MSc and EEE/ISE PART IV: MEng and ACGI

## SYNTHESIS OF DIGITAL ARCHITECTURES

Tuesday, 25 April 10:00 am

Time allowed: 3:00 hours

Corrected Copy

There are FIVE questions on this paper.

Answer THREE questions.

Correction & Q1, Q3

All questions carry equal marks

Any special instructions for invigilators and information for candidates are on page 1.

Examiners responsible

First Marker(s): G.A. Constantinides

Second Marker(s): K. Masselos

## SYNTHESIS OF DIGITAL ARCHITECTURES

## Notation

The following notation is used in this exam paper.

- $\mathbb{Z}$ : the set of integers.
- R: the set of reals.
- $S^k = \underbrace{S \times S \times ... \times S}_{k \text{ repetitions}}$ , for k > 0 and integer and S a set.

## The Questions

- 1. a) Define the term 'slicing floorplan', and provide one graphical example each of a slicing and a non-slicing floorplan. [4]
  - b) Define the term 'skewed slicing tree', and explain why a skewed slicing tree is a more appropriate representation of a floorplan than a generic slicing tree. [3]
  - c) Construct a skewed slicing tree representation, and its equivalent Polish expression, for the floorplan given in Fig. 1.1. [3]
  - d) By applying suitable annealing moves to the floorplan in Fig. 1.1, derive a floorplan with minimized area. Fully document each annealing move made. [7]
  - e) Integer Linear Programming is sometimes used for floorplan optimization. If the objective is to minimize floorplan area, and the chip aspect ratio is not fixed, explain why the area metric is problematic for the ILP methodology, and briefly describe a method to circumvent this problem within the framework of ILP. [3]



Figure 1.1 A floorplan

The presence of the meaningless Ds In Fig. 1 was announced to the conductes where at the store of the exam 2. The one variant of the word-length optimization problem can be cast as an optimization problem in the form of (2.1), where  $x \in \mathbb{Z}^n$  is a vector of integer word-lengths,  $f: \mathbb{Z}^n \to \mathbb{R}$  models the area of the implementation,  $g: \mathbb{Z}^n \to \mathbb{R}^k$  is a function with  $g_i(x) < 0$  iff the arithmetic error at output i meets a user-constraint, and vector inequalities are considered to be satisfied iff they are satisfied component-wise.

min: 
$$f(x)$$
  
subject to:  $g(x) \le 0$   
 $x \ge 0$   
 $x \text{ integer}$  (2.1)

- a) For circuits consisting only of registers, adders, and constant-coefficient multipliers, the area of each resource is proportional to its input word-length. Use this information to give a formula for f(x) in vector notation. [2]
- b) One error metric has the form given in (2.2). Briefly explain why this formulation is unsuitable for ILP solvers. [5]

$$g_i(x) = \sum_{j=1}^{n} c_{ij} 2^{-x_j} - b_i$$
 (2.2)

- If we know an upper bound  $\hat{x}_j$  for each word-length  $x_j$ , we may introduce binary variables  $y_{jw}$  for  $1 \le n \le \hat{x}_j$  with the interpretation that  $y_{jw} = 1$  iff the word-length of variable j is w bits. Write a linear constraint that expresses  $x_j$  in terms of the variables  $\{y_{jw}\}$ .
- d) Re-write (2.2) as a linear constraint in terms of  $\{y_{in}\}$ , and hence complete a formulation suitable for ILP solvers. [5]
- e) Comment on the algorithmic complexity of tackling the word-length optimization problem using your formulation. [5]

3. The discrete function approximation problem is to produce a real polynomial  $g(x) = c_0 \phi_0(x) + c_1 \phi_1(x) + \ldots + c_n \phi_n(x)$ , of order n, such that it is close to a given real function f(x) over the range  $x \in \{0, 1, 2, \ldots, 2^k - 1\}$ . Here  $\phi_i(x)$  is a polynomial of order i. We shall say that f(x) is close to g(x) when ||f - g|| is minimized, where  $||\cdot||$  denotes a suitable norm.

This question concerns the use of function approximation under two possible norms, the quadratic norm shown in (3.1), and the minimax norm shown in (3.2). One definition the inner product of such functions as in (3.3).

$$||h||_2 = \sum_{x=0}^{2^k - 1} h^2(x)$$
 (3.1)

$$||h||_{\infty} = \max_{x \in \{0, \dots, 2^k - 1\}} |h(x)|$$
 (3.2)

$$\langle f, g \rangle = \sum_{x=0}^{2^{k}-1} f(x)g(x)dx$$
 (3.3)

- a) Under the quadratic norm, derive an expression for  $c_i$ , the coefficient of  $\phi_i(x)$ , in terms of the inner product between f(x) and  $\phi_i(x)$ , stating any condition on the set of polynomials  $\{\phi_i(x)\}$  you have used. You do not need to provide the coefficients of the polynomials. [6]
- b) Show that under the minimax norm, the problem is equivalent to a linear program, explicitly stating the objective function, constraints, and variables in this case.
- c) An alternative to function evaluation through polynomial approximation is to use a direct ROM lookup. If the datapath is k bits wide throughout, write an expression for the area and the delay of the ROM lookup. [4]
- d) A k-bit multiplier and adder together consume the same area as  $10k^2$  ROM storage bits. The delay through a multiplier/adder serial combination scales proportionally to k, and when k = 1 is equal to half the delay through the equivalent ROM. Use this information to derive a relationship between k and n under which a Horner's scheme approach is (i) more area efficient, (ii) faster, than the ROM alternative. [4]

| 4. | This question relates to the code containing loop-carried dependencies shown Fig. 4.1. Read and writes are considered to take zero time, whereas multiplication is considered to take unit time. |                                                                                                                                                                                                                                                                         |
|----|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|    | a)                                                                                                                                                                                               | With reference to Fig. 4.1, briefly explain the term 'loop carried dependency'.                                                                                                                                                                                         |
|    | •                                                                                                                                                                                                | [2]                                                                                                                                                                                                                                                                     |
|    | b)                                                                                                                                                                                               | Draw a delay-weighted DFG for the while-loop in Fig. 4.1.                                                                                                                                                                                                               |
|    |                                                                                                                                                                                                  | [3]                                                                                                                                                                                                                                                                     |
|    | c)                                                                                                                                                                                               | By applying suitable retiming transformations, produce another delay-weighted DFG for which the clock period is minimal. Clearly explain each step of your working.                                                                                                     |
|    |                                                                                                                                                                                                  | [4]                                                                                                                                                                                                                                                                     |
|    | d)                                                                                                                                                                                               | The ILP formulation discussed for the retiming process is reproduced in (4.1). For the original DFG and for its re-timed version, state the minimal values of $s_{\nu}$ for all nodes $\nu$ in the delay-weighted DFG which satisfy the constraints of the formulation. |
|    |                                                                                                                                                                                                  | [4]                                                                                                                                                                                                                                                                     |
|    | e)                                                                                                                                                                                               | Explain the physical meaning of the constraint $w_r(u, v) \ge 0$ .                                                                                                                                                                                                      |
| 3  | f)                                                                                                                                                                                               | [1] Explain the purpose of the constant $N$ in (4.1).                                                                                                                                                                                                                   |
|    |                                                                                                                                                                                                  | [2]                                                                                                                                                                                                                                                                     |
|    | g)                                                                                                                                                                                               | Describe qualitatively the expected change in the objective function seen as $N$ varies over the range of $(-\infty,0]$ , explaining your answer.                                                                                                                       |

[4]

```
for i = 0 to 5 {
  a[i] := 0;
  b[i] := 0;
  \dot{c}[i] := 0;
  d[i] := 0;
  e[i] := 0;
}
while( true ) {
  read a[0];
  d[0] := c[4]*a[5];
  write d[0];
  e[0] := 2*d[0];
  b[0] := a[2]*e[0];
  c[0] := 5*b[0];
  for i = 4 downto 0 {
    a[i+1] := a[i];
    b[i+1] := b[i];
    c[i+1] := c[i];
    d[i+1] := d[i];
    e[i+1] := e[i];
}
```

Figure 4.1 Code containing loop-carried dependencies

```
min: L subject to: s_{\nu} \geq d(u) + x_{u,\nu}N, for all (u, \nu) \in E x_{u,\nu} \leq w_r(u,\nu), for all (u,\nu) \in E s_{\nu} + d(\nu) \leq L, for all \nu \in V w_r(u,\nu) = w(u,\nu) + r(\nu) - r(u), for all (u,\nu) \in E w_r(u,\nu) \geq 0, for all (u,\nu) \in E x_{u,\nu} \in \{0,1\}, for all (u,\nu) \in E
```

5. A new module selection heuristic has been proposed for the resource type set  $R = \{+, -, >, ALU\}$ , where an ALU can perform one addition, subtraction or comparison every two clock cycles. By contrast an addition, subtraction or comparison resource, +, -, and >, respectively, each has a latency of one cycle.

The type set of each operation can be described as:  $T(+) = \{+,ALU\}$ ,  $T(>) = \{>,ALU\}$ ,  $T(-) = \{-,ALU\}$ .

The module selection process can be summarized as follows. Initially, we assume that no operations are implemented in ALUs. Under this assumption, we calculate the critical path of the circuit. An arbitrary non-critical node is selected and an ALU module is selected for its implementation. We then iteratively repeat the calculation of critical path and selection of a non-critical node, until there are no non-critical nodes.

This question concerns the application of the above heuristic to the code shown in Fig. 5.1, which is required to execute in minimum latency.

- a) Given that no ALU resources are to be used, perform a latency-constrained list-scheduling and minimal resource binding, and complete datapath design (without register sharing) for this code.
- Use the module selection heuristic above to obtain an alternative datapath design, also using latency-constrained list scheduling and minimal resource binding.
- Assuming resource types +, and > each consume unit silicon area, whereas ALUs consume silicon area k, and assuming that the circuit is resource dominated, under what range of k is the design containing ALUs smaller? [2]
- d) If the circuit is not resource dominated due to multiplexer overheads, and k = 1.5, under what range of area for multiplexers is the design containing ALUs smaller? [2]
- e) By constructing example code, prove that the proposed algorithm does not always achieve the module selection corresponding to the optimal area solution if k > 1 (assuming latency-constrained list scheduling for minimum latency, optimum binding, and resource domination). [4]

```
begin

a := x + y

b := a + a

c := x - y

d := b > y

e := b + 2

f := e + b
```

Figure 5.1 Some code containing addition, subtraction and comparison operations

MODEL ANSWERS

SYNTHEOUS of DIGITAL ARCHITECTURES

repeated bisection of rectangular cells.

SUDCING

NON - SLIDBUG



b) A skewed slicing tree is one where no note and its right-child have the same type. Shered slising trees are caronical expression.

c)



124H3VH

d) i) SWAP H&3 - need to check if still sheved stiery tree still ii) SWAP 4&3 - no check regid.
We obtain 1234 HVH

1 3 4

No gapo => minimal area.

1.e) Height, H, and wilth W can be expressed linearly in tems of the variables used. However HW is nonlinear. A usual approach is to use

HW & H'W' + (H-H')W' + (W-W')H'

When H=H' and W=W'.

2. a) 
$$f(x) = c^{T}x$$

b) 
$$g_i(x)$$
 is nonlinear.  
 $g_i(xx) = \sum_{j=1}^{n} c_{ij} 2^{-\alpha x_j} - b_i$ 

c) 
$$x_j = \sum_{w=1}^{\hat{\alpha}_j} \hat{z}_{n} y_{jnw}$$

d) 
$$g_i(x) = \sum_{j=1}^{n} c_{ij} \sum_{j=1}^{n} 2^{-n} y_{j} w - b_i$$
  
Also require  $\sum_{j=1}^{n} y_{j} w = 1$   $\forall_j$  the captete the  $w = 1$   $\forall_j$  the captete the  $v = 1$   $v = 1$ 

e) Worst-cae complexity of ILP is emporential in poster size (or easity shows).

Public size here is both #von & # constraints.

Complexity is exponential in # program variables
Unt's more, compared to the original
formulation, compresity also scales with
bounds on variable length.

3/10

(1) 
$$\langle \phi_i, \phi_i \rangle = 1$$

(2) 
$$\langle \phi_i, \phi_j \rangle = 0$$
 for  $i \neq j$   
(i.e. ortho-norm)

Then 
$$C_i = \langle f, \rho_i \rangle$$

Proof:

Gnor = 
$$||f(x) - \sum_{i=0}^{n} c_i \phi_i(x)||_2$$

$$= \sum_{j=0}^{2^{k-1}} \left[ f(x) - \sum_{j=0}^{n} C_{j} \phi_{j}(x) \right]^{2}$$

$$= \sum_{j=0}^{2^{n}-1} j^{2}(x) - 2 \sum_{i=0}^{n} \frac{2^{n}-1}{j^{2}} (x) \phi_{i}(x) + \sum_{j=0}^{n} \sum_{j=0}^{n} \frac{2^{n}-1}{j^{2}} \sum_{j=0}^{n} \frac{2^{n}-1}{j^{2}} (x) \phi_{i}(x) + \sum_{j=0}^{n} \sum_{j=0}^{n} \frac{2^{n}-1}{j^{2}} (x) \phi_{i}(x) + \sum_{j=0}^{n} \frac{2^{n}-1}$$

$$= \frac{2^{n-1}}{\sum_{j=0}^{2^{n}} f^{2}(x)} - 2 \frac{n}{\sum_{i=0}^{2^{n}} c_{i}} \langle f, \phi_{i} \rangle + \sum_{i=0}^{n} c_{i}^{2^{n}} \langle \phi_{i}, \phi_{i} \rangle$$

$$\frac{\partial \mathcal{E}_{vor}}{\partial c_i} = -2 < f, \phi_i > + 2 c_i < \phi_i, \phi_i >$$

liveouring,

Min W

s.t. YX & {0, ..., 2 - 1}

 $-W \leq J(\alpha) - \sum_{i=0}^{N} c_i \phi_i(x) \leq W$ 

Get contraint is linear in the varibles, which are {C;}.

- C) Assuming constructed from K 1-5it // Cookups,

  Area = K C, K 2 K

  Delay = C2 K
- d) MAC over =  $10\kappa^2 c_1$ MAC delay =  $\frac{1}{2}C_2 K$

(i) 10kg/n < g/k2 K  $n < \frac{1}{10k}2^{K}$ 

(ii) \( \frac{1}{2}C\_2KN \left\) \( C\_2K \)
\( \text{ => } \ N \left\) \( 2 \)









